Back

Journal of Computational Biology

SAGE Publications

Preprints posted in the last 30 days, ranked by how well they match Journal of Computational Biology's content profile, based on 37 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
On the benchmarking of clustering algorithms and hyperparameter influence for cell type detection in single-cell RNA sequencing data.

Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.

2026-05-17 bioinformatics 10.1101/2025.08.20.671270 medRxiv
Top 0.2%
1.5%
Show abstract

Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.

2
Resolving the oak tree of life: comparing RADseq and whole genome resequencing methods for oak phylogenetics

Hipp, A. L.; Althaus, K. N.; Fuller, E. L.; Hahn, M.; Larson, D. A.; Mohn, R. A.; Wang, B.; Manos, P. S.

2026-05-17 evolutionary biology 10.64898/2026.05.14.725274 medRxiv
Top 0.3%
1.2%
Show abstract

Forest trees pose numerous potential challenges to phylogenomic inference. Their large effective population sizes and relatively long generation times lead to deep allele coalescence and consequently incomplete lineage sorting (ILS), which biases inferences of divergence times toward older ages and introduces gene tree discordance. Deep phylogenetic divergences, reaching back into the Paleocene, introduce reference-mapping biases. Introgression--the movement of genes between lineages--may result in different phylogenies being inferred depending on which individuals are included in analysis, even if the plurality of the genome favors the divergence history unaffected by introgression. These factors influence phylogenetic inference across the Tree of Life but are particularly prevalent in forest trees. Oaks (Quercus) are notable for all three influences. In addition, our knowledge of the oak phylogeny is currently based strongly on restriction site associated DNA sequencing (RADseq) datasets published over the past decade, which may introduce additional sources of uncertainty. In this chapter, we analyze a 322-species RADseq dataset and genome resequencing data from across the genus to address sources of uncertainty in our understanding of the global oak phylogeny, which we hope will serve as a model for other research groups working on comparable woody plant groups.

3
Evolutionary rate correlations reveal long-term co-evolutionary interactions in Drosophila melanogaster

Dagilis, A. J.; DiAngelis, B.; Lee, S.; Matute, D. R.

2026-05-23 evolutionary biology 10.64898/2026.05.21.726714 medRxiv
Top 0.4%
0.9%
Show abstract

Co-evolution between genes can occur for a variety of reasons, including co-expression of genes, epistatic interactions between them, physical interactions of gene products and many others. Co-evolutionary partners of a gene are therefore of great interest in identifying potential factors that contribute to any phenotype of interest. State-of-the-art approaches to detect these interactions use correlations of evolutionary rates across a broader phylogeny, and so by necessity identify interactions only among genes that are present across long evolutionary time periods. This makes the methods unwieldy when interest lies in a single focal organism in which the genes of interest may have evolved in the recent evolutionary past. Here, we present a new approach to calculating evolutionary rate correlations which focuses on extracting maximum coverage for a single focal species, while retaining signals of co-evolution across large clades. We show how this approach is able to identify potential interactions even in highly studied species and highly studied genes, with a focus on the D. melanogaster sex-determiner, Sxl, using data from 72 species of Dipterans.

4
Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline

Alsaiari, A.; Turki, T.; Taguchi, Y.-h.

2026-05-04 bioinformatics 10.64898/2026.04.29.721782 medRxiv
Top 0.4%
0.9%
Show abstract

Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AI-based pipeline when tackling prediction problems pertaining to gynecologic cancer studies. MSC92B05; 68T09

5
Cell Type Weighted Dimensionality Reduction

Putta, S.; Jensen, W.; Devakonda, S.; Pennell, L.; Croteau, J.

2026-05-05 bioinformatics 10.64898/2026.04.30.721796 medRxiv
Top 0.4%
0.9%
Show abstract

High-dimensional single-cell technologies, such as flow cytometry and CITE-Seq, typically rely on established lineage markers to define cell identities. Additional markers are commonly analyzed within the context of these predefined cell types. Nonlinear projection methods such as t-SNE and UMAP provide a visual framework for this analysis by enabling the overlay of cell types and marker expression. However, these methods frequently produce projections where distinct cell types substantially overlap, hindering interpretation of marker expression patterns relative to known cell types. In this study, we investigate the underlying causes of this phenomenon and demonstrate that such overlaps often stem from the inherent high-dimensional structure of the data rather than limitations in the dimensionality reduction algorithms themselves. To address this, we introduce Cell Type Weighted Dimensionality Reduction (CWDR), a novel approach that incorporates lineage-based information through a supervised weighting mechanism. By integrating both cell identity and marker expression, CWDR preserves the visual separation between predefined cell types while maintaining the local variance necessary for downstream analysis. We validate our method across multiple high-dimensional flow cytometry and proteogenomic datasets. Our results show that CWDR significantly reduces inter-cluster overlap compared to traditional methods, providing a clearer framework for visualizing marker expression within the context of specific cell lineages.

6
A Rarefaction Approach to Identify Local Introgression in a Three Population Tree

Smith, T. Q.; Szpiech, Z. A.

2026-05-16 evolutionary biology 10.64898/2026.05.13.724952 medRxiv
Top 0.5%
0.8%
Show abstract

Pattersons D statistic, also known as the ABBA-BABA statistic, is widely used to detect the presence of archaic genome-wide introgression between two non-sister taxa. Requiring only a single lineage from each of four taxa where one taxon acts as an outgroup to determine the ancestral allele, Pattersons D, counts the imbalance between the number of biallelic sites where either the second and third taxa (ABAB site) or the first and third taxa (BABA site). When there is no introgression, these counts are expected to be equal, and a discordance between counts suggests introgression from the third taxon into either the first or second. Pattersons D is limited to the detection of genome-wide introgression and exhibits a high false-positive rate when applied to smaller genomic segments. Here, we present a new method, D STatistic with Allelic Rarefaction (D*), to address these limitations. D* uses multiple lineages and does not require an outgroup to calculate the imbalance between the number of alleles found exclusively in the second and third taxa and the number of alleles found exclusively in the first and third taxa. D* employs a rarefaction technique to correct for unequal sample-size and allows multiallelic sites. We use simulations to show that D* has better precision and recall for detecting introgressed segments of DNA when compared to similar methods under a wide variety of model parameters and in the presence of technical artifacts common to ancient DNA analyses. We conclude with an analysis of Denisovan DNA introgression in modern day Papuans. Precompiled executables, the manual, and source code can be found at https://github.com/TQ-Smith/DSTAR

7
Novel linkage disequilibrium-based genotype-by-environmental interaction method for genomic prediction of cotton yield and fibre quality traits

Li, Z.; Li, X.; Liu, S.; Wilson, I.; Zhu, Q.-H.; Stiller, W.; Conaty, W.

2026-05-06 plant biology 10.64898/2026.05.03.722538 medRxiv
Top 0.7%
0.6%
Show abstract

Genomic prediction (GP) across diverse environments has a potential to accelerate genetic gain in cotton breeding programs. A major challenge in GP is modelling genotype-by-environment interactions (GEI), which is essential for selecting stable and high-performing genotypes under variable production conditions. However, incorporating GEI into GP models increases the dimensionality and computational complexity, risking complex models that are impractical to use on commercial breeding-scale data sets because of run times and computational demands. This study addresses two primary aims. Firstly, we evaluate the practical benefits of GEI-informed GP for predicting economically important cotton traits. Second, advanced statistical modelling strategies are developed and assessed for integrating genomic and environmental data at scale. We propose a dimensionality reduction approach that combines linkage disequilibrium network analysis with principal component techniques to reduce redundancy while preserving informative variation. Using this reduced dataset, we implement Bayesian linear regression models and, for comparison, deep residual neural networks for genomic prediction. Analyses were conducted on a large multi-environment dataset from the CSIRO cotton breeding program, comprising 3,236 breeding lines, 54 environmental covariates, and 8,049 yield and fibre quality phenotype records collected over 10 years and 9 locations representing 41 year-location combinations. Results demonstrate that generally Bayesian linear regression approaches outperform BG-BLUP models, with all three linear/linear mixed methods providing clearly more reliable performance than the deep learning models. These findings highlight the value of using interpretable statistical models for integrating genomic and environmental information to support selection decisions under diverse environmental conditions.

8
Substitution rate variation, not hidden paralogy, drives false hybridization signal in phylogenetic network inference

Li, B.; Ane, C.

2026-05-18 evolutionary biology 10.64898/2026.05.11.723986 medRxiv
Top 0.8%
0.5%
Show abstract

Phylogenetic network inference methods are increasingly used to detect hybridization and gene flow from genomic data, but their robustness to common sources of model violation remains poorly characterized. We conducted a simulation study to evaluate the effects of hidden paralogy and substitution rate variation on two widely used network inference methods: find_graphs from ADMIXTOOLS 2 and SNaQ. Using an eight-taxon species tree calibrated from an empirical reptile phylogeny, we simulated data under various levels of hidden paralogy (from none to strong) and three levels of rate variation (none, gene-specific, and lineage-specific). We found that hidden paralogy had limited impact on network inference under the conditions examined: both network methods correctly favored a tree without reticulation, and ASTRAL recovered the correct species tree every time. In contrast, lineage-specific rates severely biased find_graphs, inflating worst f-statistic residuals well beyond the standard acceptance threshold. SNaQ correctly selected a tree model almost always across all conditions, though its network with h = 1 reticulation displayed the true species tree with a lower probability under lineage-specific rates. We also show that the standard worst residuals threshold of 3 for find_graphs produces inflated type I error even without rate variation, and we recommend empirical calibration of this threshold within each study system.

9
Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv
Top 0.8%
0.5%
Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

10
Gene family evolutionary dynamics reveal convergent genomic signatures in pancrustacean metamorphosis

Campli, G.; Chipman, A. D.; Waterhouse, R. M.

2026-05-08 evolutionary biology 10.64898/2026.05.06.723392 medRxiv
Top 0.9%
0.5%
Show abstract

Arthropods exhibit an exceptional diversity of life histories, where developmental modes involve moulting stage progressions with changes ranging from the bare minimal to the dramatically transformative. While this variability drives many research questions aiming to understand evolutionary and developmental underpinnings of life history differences, it can complicate comparative analyses across taxa. However, this can be approached by applying a framework that defines metamorphosis as a post-embryonic stage progression characterised by substantial changes in morphology and adaptive landscape. Employing this framework with a phylogenomic dataset spanning 26 orders and encompassing four independently arising metamorphic lineages, we explore gene repertoire evolutionary dynamics potentially associated with metamorphosis in Pancrustacea. The approach contrasts gene family evolutionary dynamics inferred to have occurred in the last common ancestors of the metamorphic Insecta, Copepoda, Eucarida, and Thecostraca, with those of their sister lineages, as well as of descendent and ancestral nodes. The results reveal that the metamorphosis ancestors are characterised by an elevated number of gene family births and expansions. Expanded gene families share a set of commonly enriched biological processes across all metamorphosis ancestors, suggesting functional convergence by independent evolution of distinct gene families involved in embryonic and post-embryonic development and nervous system differentiation. Evolutionary modelling further highlights a subset of these families exhibiting signatures of adaptive, lineage-specific gene family size increases associated with metamorphic development. These families include genes implicated in neural and sensory development, segmentation, and moulting. These findings support a model of the evolution of pancrustacean metamorphosis where distinct gene families from a common functional toolkit expand and are co-opted into facilitating transitions to multi-phasic life cycles. This reframes the role of moulting in arthropod diversification to be recognised as an important reservoir of genetic change that can potentiate truly remarkable life history transitions.

11
Increasing Phenomic Prediction Efficiency Using A Principal Component Analysis Based Pre-Processing Of Near Infrared Spectra

Bienvenu, C.; Roger, J.-M.; Sene, M.; Castro Pacheco, S. A.; Singer, M.; Felaniaina, B. L.; Terrier, N.; De Bellis, F.; Pot, D.; DE VERDAL, H.; Segura, V.

2026-05-13 genetics 10.64898/2026.05.10.724118 medRxiv
Top 0.9%
0.5%
Show abstract

Phenomic prediction (PP) is a breeding value prediction method using near infrared spectroscopy (NIRS). Spectra pre-processing is a key step in the analysis pipeline of PP and generally involves chemometrics methods. However, there is still little understanding in the genetics community of what pre-processing does and why it increases performances. Consequently, the choice of pre-processing is done either arbitrarily or through a search of the optimal set of methods and associated parameters. In this study, we propose a PCA-based pre-processing method where genetic values of spectra are estimated on a set of principal components instead of individual wavelengths. This way, estimations are based on a few informative and orthogonal features of spectra instead of many correlated, uninformative wavelengths. We tested this new pre-processing method on five data sets representing four plant species (maize, rice, sorghum and grapevine). Results show that it performs as good, or better than the best classical chemometric pre-processing methods in almost all cases. Combining PCA-based and classical chemometric pre-processing methods maximizes predictive ability. Moreover, this pre-processing method opens up possibilities of better understanding and selecting parts of the spectral information that are relevant for the prediction of breeding values. Indeed, components representing together about 1% of spectral variability were found to be responsible for most of PP predictive ability. Plain language summaryCultivated plants are the result of a breeding process during which their genetic values are used to select those to breed. Estimation of breeding values requires heavy experimental means and is time consuming. Phenomic prediction is a low cost and high throughput genetic value estimation method that is increasingly being used. It often uses near infrared spectroscopy measurements as predictors of genetic values that are easy to collect and thus routinely used in many species. However, near infrared spectra generally require pre-processing before being used in prediction. Currently used pre-processing methods arise from the chemometrics community, and still deserve a better in-depth appropriation by geneticists. In this study, we propose a new pre-processing approach that performs as good as or better than the best chemometric pre-processing generally used, reduces computation time, and allows for a better understanding of what parts of spectral information are relevant for prediction. Core IdeasO_LIWorking on principal components of spectra instead of wavelengths increases predictive ability of phenomic prediction and performs as good as or better than classical chemometrics pre-processing C_LIO_LIWorking on principal components of spectra requires less optimization of parameters than chemometrics pre-processing C_LIO_LIAbout 1% of spectral variance is responsible for most of the predictive power of phenomic prediction C_LIO_LIWorking on principal components of spectra pre-processed with classical chemometrics pre-processing can increase predictive ability even more C_LIO_LIPCA-based methods are valuable to optimize predictive ability of phenomic prediction and could be used more widely in the quantitative genetics field C_LI

12
Clustering Strategies Improve Structure-Preserving Visualization of Single-Cell RNA-seq Data with CBMAP

Alchaar, M.; Dogan, B.

2026-05-04 bioinformatics 10.64898/2026.04.30.721861 medRxiv
Top 1.0%
0.4%
Show abstract

Dimensionality reduction for visualization is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis due to the extremely high dimensionality of gene expression profiles. However, widely used nonlinear embedding techniques such as UMAP and t-SNE can introduce substantial distortions when projecting data into two-dimensional space, potentially altering global organization, local neighborhoods, and distance relationships in ways that may mislead downstream biological interpretation. In this study, we investigate the applicability of Clustering-Based Manifold Approximation and Projection (CBMAP) for the visualization of scRNA-seq data and systematically examine how clustering strategies influence the quality of the resulting embeddings. CBMAP was integrated with several clustering algorithms commonly used in single-cell analysis, including k-means, Leiden, HDBSCAN, Secuer, HGC, and FlowSOM. The resulting embeddings were evaluated using quantitative metrics that measure global, local, and distance-level structure preservation and were compared with widely used dimensionality reduction methods such as UMAP, t-SNE, and PaCMAP across multiple benchmark datasets. Our results demonstrate that the clustering stage plays a critical role in determining the structural fidelity of CBMAP embeddings. Clustering algorithms specifically designed for single-cell transcriptomic data, particularly Secuer, produced more consistent preservation of global relationships between cell populations. Across multiple datasets, CBMAP more faithfully preserved global structural organization and inter-population distance relationships than the compared methods, although local neighborhood preservation was generally weaker than in techniques optimized for local structure. Importantly, CBMAP embeddings retained biologically meaningful relationships in trajectory benchmark datasets. When combined with RNA velocity analysis, CBMAP successfully preserved cyclic progenitor states and branching differentiation trajectories, demonstrating compatibility with trajectory-aware visualization. These findings indicate that CBMAP provides a structure-faithful visualization framework for scRNA-seq data and that clustering selection plays a central role in determining embedding quality.

13
Promises and limitations of local ancestry inference in imputed ancient genomes

Bougiouri, K.; Irving-Pease, E. K.; Frantz, L. A. F.; Racimo, F.; Petr, M.

2026-05-20 evolutionary biology 10.64898/2026.05.19.725905 medRxiv
Top 1%
0.4%
Show abstract

Recent advances in genome imputation have enabled the application of state-of-the-art statistical methods--originally developed for present-day genomes--to ancient genomes. One class of such methods, known as local ancestry inference (LAI), can model an individuals genome as a mosaic of tracts assigned to different putative ancestral sources, revealing patterns of genetic ancestry across the genome. However, most LAI methods have been designed to study recent admixture events in human history, and they generally assume large panels of present-day genomes. Despite the recent availability of high-quality imputed ancient genomes, it remains unknown to what degree LAI inference is reliable for such datasets. Ancient DNA is often characterized by heterogeneous geographic and temporal sampling, varying degrees of divergence between ancient source proxies and admixing populations, and complex demographic histories. Here, we performed an extensive set of population genetic simulations to evaluate the accuracy of four popular LAI methods-RFMix, FLARE, MOSAIC and simpLAI-under different demographic scenarios, various temporal sampling schemes, sample sizes, and admixture dates. We quantify the accuracy of these methods as a function of different parameters in practically relevant scenarios, and provide general guidelines for future studies utilizing LAI in ancient DNA research.

14
Machine Learning Analysis to Define Cell Lineage in Leiomyosarcoma

van IJzendoorn, D. G. P.; Przybyl, J.; Hastie, T.; Bovee, J. V. M. G.; Matusiak, M.; van de Rijn, M.

2026-05-12 cancer biology 10.64898/2026.05.08.723931 medRxiv
Top 1%
0.3%
Show abstract

IntroductionCellular differentiation and lineage commitment are known to be associated with differences in DNA methylation. Leiomyosarcoma (LMS) is a tumor thought to originate from smooth muscle cells in the walls of vessels in the soft tissue (STLMS) or from the uterine myometrium (ULMS). Here, we identify the methylation signatures of normal smooth muscle cells from blood vessels and the uterine wall and compare these with those found in STLMS and ULMS. We hypothesized that these methylation signatures could be used to assign a smooth muscle subtype of origin to individual leiomyosarcomas, and that tumors of different origin would show biological differences with potential therapeutic relevance. MethodsTo define methylation profiles for smooth muscle from vessel walls versus those found in myometrium, EPIC methylation profiling was performed on DNA from 49 formalin-fixed paraffin-embedded (FFPE) normal smooth muscle samples. A supervised machine learning algorithm (Random Forest) was used to distinguish the methylation patterns of normal smooth muscle cells in vessel walls from those in the myometrium. The resulting classifier was applied to methylation data on 67 cases of LMS with corresponding bulk RNAseq data to identify which tumors showed a methylation signature most consistent with either blood vessel wall (LMSvessel) or myometrial smooth muscle (LMSwall). A custom signature matrix derived from scRNAseq data from 6 samples of LMS was used in CIBERSORTx analysis to compare the cellular composition of LMS cases with a vessel or uterine wall methylation signature. ResultsA high degree of correlation was found between the known site of origin for LMS (STLMS vs ULMS) and the methylation signature derived from different types of normal smooth muscle. LMSwall tumors compared to LMSvessel tumors had significantly higher activation of the PD-1 checkpoint pathway in RNAseq analysis. Digital flow cytometry by CIBERSORTx analysis showed an increased expression of transcriptomic signatures of several immune cell subtypes in LMSvessel tumors. ConclusionUsing a supervised machine learning approach we classified LMS samples as either showing a high similarity in methylation patterns to normal smooth muscle cells of either the vessel wall or the myometrium. We found a correlation between LMS showing either a "vessel" or "muscle wall" methylation signature and their site of origin, but notably we also identified some exceptions. When classified based on their methylation signature LMSwall and LMSvessel differed in their PD-1 pathway activation and in their predicted immune cell populations, suggesting potential implications for immunotherapeutic approaches.

15
Modeling Site-Specific Mutation Patterns in Pandemic-Scale Phylogenetics

Martin, S.; Ly-Trong, N.; Minh, B. Q.; Goldman, N.; De Maio, N.

2026-05-04 evolutionary biology 10.64898/2026.04.30.721865 medRxiv
Top 1%
0.3%
Show abstract

Models of genome evolution often account for different evolutionary rates at different genome positions due to, e.g., varying selective pressures or mutation rates. Recent evidence from millions of publicly shared SARS-CoV-2 genomes has revealed a more complex mutational landscape than can be modeled with existing approaches. Here, mutation rates are in fact not only highly position-specific, as currently modeled, but also nucleotide-specific; for example, specific mutations can occur very often at certain determined genome positions, while at the same positions other mutations might not be highly recurrent. Here, we propose and investigate a general model of genome evolution where each genome position is allowed to evolve under an independent, non-normalized substitution rate matrix describing site-specific rates of all mutation types ("Site-Specific Matrix" model, or SSM). We implement SSM in the efficient pandemic-scale phylogenetic inference software CMAPLE. Large-scale genomic epidemiological simulations suggest that, given enough data, SSM can accurately infer position- and nucleotide-specific substitution rates for more frequently observed nucleotides (typically the reference nucleotide), while other rates require higher levels of divergence. Simulations also show that SSM has a modest impact on the accuracy of phylogenetic tree estimation. We use SSM to analyze the evolution of millions of SARS-CoV-2 genomes and observe substantial mismatches between the substitution rates of classical rate variation models and our SSM estimates. These results suggest that classical models of rate variation are inadequate for modeling site-specific mutation patterns and that SSM is a useful alternative for large-scale genome analyses.

16
A new method based on genome alignments provides a highly resolutive target enrichment set for weevils (Coleoptera, Curculionoidea)

ZELVELDER, B.; BENOIT, L.; LOISEAU, A.; HARAN, J.; ALLIO, R.

2026-05-13 evolutionary biology 10.64898/2026.05.09.724036 medRxiv
Top 1%
0.3%
Show abstract

Target enrichment methods have provided unprecedented advances in phylogenomics. Targeting hundreds of conserved regions has proven to be a good tradeoff between cost and efficiency, while being useful for museomics and diversified non-model clades. Unfortunately, current methods used for identifying such regions involve high degrees of conservation within targeted elements, usually pushing researchers to rely on flanking data with little guarantee for homology. With a growing number of high quality genomes available throughout the Tree of Life emerges new opportunities to improve marker selection. In this study, we introduce GABBI, a new method for designing target capture probes by taking advantage of genome alignments, avoiding the selection of a single reference genome that can cause notable biases. We compare GABBI-derived markers to the most commonly used probe design method, PHYLUCE, at two taxonomic scales, the weevil superfamily Curculionoidea and the tribe Pachyrhynchini. At both taxonomic scales, results show that our new method allows identifying more variable loci that prove to be more phylogenetically resolutive than the PHYLUCE-derived ones. Doing so, we provide the first probe set specifically designed for weevils, targeting a wide set of 4,255 shared homologous regions, encouraging future research on systematics and macroevolution of one of the most diverse and economically important groups of insects. By providing GABBI as an automated and open-access pipeline, we hope to open new probe design opportunities to other taxonomic groups that face similar phylogenetic obstacles.

17
HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure

Cheng, Z.-R.; Chang, J.-M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725269 medRxiv
Top 1%
0.3%
Show abstract

Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.

18
Deep analysis of FANTOM CAGE data reveals hierarchical patterns of TSS co-deployment hubs and their disruption in cancers

Meduri, R.; Satish, A. L.; Singh, U.

2026-05-18 genomics 10.64898/2026.05.15.725323 medRxiv
Top 1%
0.3%
Show abstract

Selective deployment of multiple transcription start sites is a major regulatory feature of human transcriptomes. FANTOM CAGE data exhibit a near-universal TSS deployment parsimony which is disrupted in cancers. We have recently shown that TSS deployment is sensitive to gene function, futile upstream transcription, and cellular biosynthetic states. Patterns in FANTOM CAGE data can reveal mechanisms underlying TSS co-deployments. We propose and test the possibility that some TSSs act like epromoters and act as co-varying hubs of transcriptional activities for multiple other promoters. Using deep analysis of CAGE data implemented through neural networks we show that non-cancers implement transcription co-deployments through cores of epromoter-like TSSs which are generally proximal to their start codons. These TSSs show enhancer-like TFBSs profiles. A comparison with cancer CAGE data shows that the concentrated epromoter core is disrupted in cancers with multiple distal TSSs replacing the proximal TSS cores. We provide evidence that the core TSSs are rich in YY1 and CTCF binding sites and associated with genes coding for transcription factors. Our findings show that covariance of TSS deployment is sensitive to transcriptional resource cost and a parsimonic design of TSS co-deployments depends on proximal TSSs in non-cancers, a mechanism grossly disrupted in cancers. HighlightsO_LIHeterogeneous FANTOM CAGE data contains universal patterns of TSSs co-deployments. C_LIO_LITSS co-deployments exhibit a parsimonious "core-covariant" scheme which is disrupted in cancers. C_LIO_LICore TSSs are enriched in transcription factor binding sites and gene functions which justify biological features of the samples. C_LIO_LIThe DL pipeline we present identifies the core-covariant TSS sets in an unbiased manner. C_LI

19
From time-course expression to gene regulation: direct linear ODE inference without finite-difference approximation

Huang, X.; Ang, A.; Vasoya, A. P.; Wang, Y.; Teresa, P.

2026-05-20 systems biology 10.64898/2026.05.18.726023 medRxiv
Top 1%
0.3%
Show abstract

Inferring gene regulation from time-course expression profiles is essential for understanding how cells transition between states during development, differentiation, and disease progression. Existing approaches often model expression dynamics with ordinary differential equations (ODEs). However, due to the computational complexity of directly solving these ODE models, most methods rely on finite-difference approximations of temporal derivatives, which can amplify measurement noise, introduce discretization bias, and lead to unstable or biased parameter estimates. To fill this gap, we develop the first computational method to directly learn a linear ODE model for gene regulation inference without relying on finite-difference approximations. We first formulate an optimization problem that directly exploits the closed-form solution of the linear ODE system. We then solve this problem via gradient descent, deriving analytical gradients with respect to the model parameters; these gradients involve matrix exponentials and integrals, which are challenging to directly compute. To make the computation efficient, we further use high-order Taylor approximations of the gradients whose truncation error is on the order of machine precision. In addition, we establish theoretical results demonstrating an inherent, non-vanishing gap between our exact solution and solutions derived from finite-difference approximations, which underscores the theoretical advantages of our approach. Finally, we demonstrate that our method consistently outperforms competing approaches on both simulated data and real-world scRNA-seq datasets in terms of AUROC. Our source codes can be accessed here: https://github.com/EJIUB/ExactLinearODE

20
Discriminative learning of substitution matrices and gap penalties for pairwise alignment of biological sequences

Ciach, M. A.; Zacharopoulou, E.; Startek, M. P.; Miasojedow, B.; Alexiou, P.

2026-05-18 bioinformatics 10.64898/2026.05.14.725168 medRxiv
Top 1%
0.3%
Show abstract

Pairwise alignment scores are used to classify pairs of sequences in many areas of bioinformatics, including homology search, predicting interactions, or read mapping. The relative scores of different pairs strongly depend on the choice of a substitution matrix and gap penalties, but the existing approaches for the estimation of these parameters do not directly optimize them for the task of classification. In this work, we present DiscrimAlign, a statistical model for discriminative learning of substitution matrices and gap penalties from a dataset of positive and negative pairs of unaligned biological sequences. The model links the alignment score of a sequence pair with the associated binary label through a logistic function and learns the parameters by likelihood maximization. We analyze theoretical properties of the model, derive and implement a learning procedure, study its performance in simulated experiments, and apply it to predict microRNA-target interactions. We show that sequence alignment with discriminative substitution matrices and gap penalties predicts the interactions comparably to state-of-the-art neural network classifiers while being more interpretable. An implementation of the model and reproducibility workflows are available at https://github.com/BioGeMT/DiscrimAlign.